Prosper Loan data provided by Udacity
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
This dataset contains loan information. I have a bleak background in the domain and I believe that this might be good, in the sense I would learn more and bad, in the sense my lack of knowledge in domain could hinder my findings.
As part of the Exploratory Data Analysis of this dataset, I would like to explore many dependencies between various variables. My intuition is that there are many unsaid correlations between the variables which can only be found out by exploration. Let us first load the dataset and find the various variables.
Here we explore the dataset as follows
## [1] "No of data points: 113937"
## [1] "No of features: 81"
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Now that we have seen the variables, I would like to plot some variable one by one to see what insights they can give.
Firstly I feel Prosper rating is the most important variable. Rating has levels(factor) and it seems score is a range. I want to know how many number of loans customers take with respecting rating levels. This bar chart has some insights on most popular rating. As one can notice from the plots, majority of loans are non classified(‘NA’), and among the classified ones, ‘C’ is the most rated and ‘AA’ is least. Let me check the same using the table command in R, if it aligns with my data (if my plot is correct :P).
##
## A AA B C D E HR
## 29084 14551 5372 15581 18345 14274 9795 6935
Let me now explore the Prosper Score Variable. What is the maximum, average score? This is another rating method. As per the plot, among the scores, mojority is concentrated in 4-8 score. Let me verify the same using summary of this variable.
##
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
As we can see the 1st and 3rd quartile are 4 and 8 respectively. Mean and median are close to 6. So I can conclude on an average loans are given 6 score. Note there are many non classified loans.
Now I would want to know how is the Loan Status variable distributed. A status can tell how many loans are completed, current, pending or defaulted. This might tell us the status of customers loan status patterns. I back the plot with numeric count of each variable.
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
This is the distrubution status of loans. As expected most of loans are running in Current status. And another larger portion is completed. No interesting findings though. :(
How about knowing the employment status of people taking loans. Im curious to know, which set of people take more loans!
##
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
Most of the loan takers are employed or fulltime working people. Very few non Employed people take loans. Well if my knowledge is correct, I think loans are granted on the basis of employment status of a person, wheather a person will be able to give back the loan or not. And I think this could be the reason for such a distribution.
Let me try and explore the income ranges of people. Can income range of loan takers tell me somethings? Are there some specific income ranges which are targetted for loans?
But before plotting I would like to factor the variable for better distribution.
## Not employed $0 $1-24,999 $25,000-49,999 $50,000-74,999
## 806 621 7274 32192 31050
## $75,000-99,999 $100,000+ Not displayed
## 16916 17337 7741
Notice the borrower income range plot. Medium income range people are more loan takers, with most of concentration in the people with income range $25000-$75000. These could be customers which are targetted for home loans, car loans, etc. as middle income range people (middle class) are people who would want to buy house or car but cannot afford the full payment. Loan comes as a saviour for these.
But Do people state their true income in the forms? How is stated income varied?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
I find median stated income range is at $4667 and mean of $5608 which is quite average for a professional with few years of experience. Peak of loan takers lie in the $3000-$6000 range.
Now I would like to know how much loan does a loan taker takes based on its income, i.e. how is DebtToIncomeRatio varied?
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The data has a long-tailed right-skewed but as expected. It’s expected the majority of people in U.S have a credit history, and the ratio should be low enough for a secured repayment. Seems like 25% is the threshold for most borrower.
Now I would like to know more about the Prosper loan company. How did Prosper loan do through-out these years? This can be taken from the number of loans people took over the years. Let us try to find out the data’s timeline distribution.
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 22 5906 11460 11552 2047 5652 11228 19553 34345 12172
As per the plot, we can see that the data is provided for the year 2005-2014. Most of the loans were originated in the year 2013. Notice there is very less loans in the year 2009. This could be due the market downfall in the year 2008 due the Global Financial Crisis. One can notice the rise in the number of loans post that year. It seems that although 2009 wasnt a good year, Prosper loan did do a great job in making a comeback and covering up for the loss! Interesting!
Coming back to the loan takers. I would want to know what made people take the loans? What was their purpose? Since ListingCategory is not factorized properly, I first create the factors and then update the variable.
## Debt Consolidation Home Improvement Business
## 58308 7433 7189
## Personal Loan Student Use Auto
## 2395 756 2572
## Baby & Adoption Boat Cosmetic Procedure
## 199 85 91
## Engagement Ring Green Loans Household Expenses
## 217 59 1996
## Large Purchases Medical/Dental Motorcycle
## 876 1522 304
## RV Taxes Vacation
## 52 885 768
## Wedding Loans Other Not Available
## 771 10494 16965
We notice that not many people wants to provide the purpose of loan listing. There’s a surprisingly amount of needs for debt consolidation. My intuition is that as young people are going out to the real world and start to repay their student debt, purchase cars, mortgage their apartment, etc. This is not true people who are settled.
Now that I know the purpose, What is the amount of loan which loan takers wants to take?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
This histogram is really interesting! There is peaks at $5000, $10000, $15000, $20000. This tells that people tend to take loans in round off amounts in multiples of $5000. As we can see that minimum loan amount was for $1000, and maximum loan amount was for $35000.
Its’ interesting to know the range of borrower’s rate. Borrower rate changes as per the term, let me explore this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
I’d suspect they have different borrower rate for different term & rating but here we get a median of 18.4%. However, there is a huge notable spike around 31%. I wonder what could be the reason.
Prosper data is US based firm. Is there any specific region/state where more or less loans are taken? Lets explore BorrowerState distribution to answer that.
##
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
Highest no of Loans were dispersed for borrowers from state of CA with a count of 14,717. and least were from ND with a count of 30.
I find this dataset divided into two categories: one from borrower’s perspective and one from Investors. Some variables tell information about the borrower and some about investor.
I would like to explore the variable Lender’s yeild first. Is it related to borrower information in any sense?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1242 0.1730 0.1827 0.2400 0.4925
Interestingly, noticing the peak in 0.31, there seems to be a direct relationship between Borrower’s rate(from previous plot) and Lenders’ yeild.
What is the term of loan taking?
##
## 12 36 60
## 1614 87778 24545
There are three terms 12, 36, 60 months(no of years 1,3,5). We notice that most of the loan are taken for a term of 36 months.
The dataset is comprised of 81(original) variables with 113937 observations. Variables are of classes int, numeric, date, and factor. The dataset includes the loans provided from year 2005-2014. Although 81 variables seems to be too many at first, but on second look at the data, we notice that these variables can be seen as 2 main players: the “Borrowers” variables & “Investors” variables.
During the analysis, it seems that there are 2 main players from the data: the “Borrowers” & “Investors”. For Borrower, I believe the Prosper Score, Proser Rating are the main indicators of a quality of borrowers. And for an Investor, I now understand Lender Yield is the most important factor. I would like to further explore these in my bivariate analysis.
Analysing the time and month of the loan listing, analysing their economic and financial status can help support the further investigation. There is a possibility that Borrower’s state could provide some interesting insights for specific locations where loan is more prevalent.
No. So far, I havent created any new variables, although I have created factorized and changed data formats of few variables.
Yes, there is few unsual distribution. There is high spike in lender yield & borrower rate and the spike in LoanOriginalAmount that people tend to buy in bulk. We also notice that most of the loans were taken for a period of 36 months. Although I am unable to conclude on why this could be happeneing. For tidying up the data I converted some of the variables to its factors, simply to better visualize them into categorical forms.
Now that I have explored some individual variables, I would like to know their relationships with each other. We start with the Bivariate plot section next. # Bivariate Plots Section My intuition is that there is relation between BorrowerRate and ProsperRatings. Let me check the same. I want to see borrower rates’ distribuition for each of the ratings. First I would want to factorise alpha variable. Ignoring the non classified loan ratings, We notice that there is direct linear relationship between the prosper rating and borrowers rate. As the rating moves from one level to another, the borrowers median rating increases. ‘A’ rating is an exception.
Next I would want to know about what is the yearly income for different ratings?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 38400 56000 67300 81900 21000000
The above plot seems little unclear due to many outliers. I would like to rebuild this plot on a subset of data.
This is much better. The above plot helps us in understanding that yearly income is related to proper Rating. Rating decrases for low income levels. Again ‘A’ rating is an exception.
Next I would like to see the Correlation between DebtToIncomeRatio and BorrowerRate:
## [1] 0.06291678
:( The relationship is not that significant when we do a correlation test.
Is there any relation between EmploymentStatus and Loan Amount? Strangely though retired category seems to have higher loan original amount than part-time. In fact part-time employed have the lowest Loan original amount. I am unable to find out the scenario for why this happened. My guess is that may be retired people would need loan for medical expenses, or any other financial needs etc.
Is there any relation between BorrowerRate and Region(BorrowerState)? Not such an interesting plot for now.
Lets now explore Investors’ variables. Just like Borrower Rate and Prosper Rating, is there any relation between Lender Yield and Rating? The plot is as expected, As ratings go down Lender’s yield increases.
Earlier I explored, that loans are taking in the multiples of $5000 and also in term of 12, 36 or 60 months. Now Im curious if they have a relation. This chart is a new interesting insight, although the majority of loan are in 36-month term. The Loan original amount is significantly higher for 60 months term.
I wanted to Borrower and Investors patterns; and variable which related to them. So far, the only relationship I found is through the proprietory Prosper Scoring system. Other factor I was trying to compare was not having any particular relationship.
Well, all plots were as expected, so nothing very interesting. One thing was that retired category seems to have higher loan original amount than part-time.
The relationship between (Prosper Rating and Lender Yield) and (Prosper Rating and Borrower Rate) has an inversed relationship. The higher the rating, the lower the borrower rate and lender yield.
Now that we have explored the binary relations, I would like to explore if there are any multivariate raltions as next.
I feel prosper rating has a relation with DebtToIncomeRatio. Also What happens if we include the parameter of LenderYield. I would like to make a scatter plot here. Here is how we take a closer look at Lender Yield vs Prosper Rating and how Prosper Rating was influced by Debt to Income Ratio. This plot tells about the relationship between Prosper rating and lenders yield. The higher the risk, the lower the rating, the better the lender yield. We also noticed high rank like AA would not have DebtToIncome ratio more than 25% and although most borrowers have lower DebtToIncome Ratio, there’re still high DebtToIncome ratio borrowers and fall in lower ProsperRating. Therefore, the shape is upward triangular. Quite Intriguing!
Now I want to know if there is specific term where lender’s Yield is more or less! This is a closer look for lender yield vs prosper rating divided by Term variable. The majority of loans opt-in for 36-month term and the return for 36-month and 60-month are just higher than 12-month, also considering the fact there’re less loan in 12-month term than other term.
Now is there some similar relation with Borrower’s data? Prosper must have optimized their model throughout the year and as we see the borrower throughout the year, the variation between borrower rate is not that significant anymore and we tend to have smaller standard deviation year-over-year. Something worth noticing is the amount of borrowing suddenly decreased in 2013.
Term loan is quite a good indicator whether we have a better Lender Yield or not. Also, we see how three variables Lender Yield, Prosper Rating and Debt To Income Ratio come together and how it affect each order.
There seems to be a fixed borrower rate in criteria HR and AA. This indicates that the criteria for eligibility of AA and HR must be strict.
I did not create any specific models from dataset.
Here I put the most interesting findings of my dataset.
One of the most intriguing plot for me. As this tells about the relationship between Prosper rating and lenders yield. The higher the risk, the lower the rating, the better the lender yield. We also noticed high rank like AA would not have DebtToIncome ratio more than 25% and although most borrowers have lower DebtToIncome Ratio, there’re still high DebtToIncome ratio borrowers and fall in lower ProsperRating. Therefore, the shape is upward triangular.
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 22 5906 11460 11552 2047 5652 11228 19553 34345 12172
This graph shows how the downfall of business in the year 2009 mainly due to the dot com burst (or the recession) that happened around this time . Once the companies started to stabilize they eventually started to improve over the subsequent years as it can be seen from the graph .
## Debt Consolidation Home Improvement Business
## 58308 7433 7189
## Personal Loan Student Use Auto
## 2395 756 2572
## Baby & Adoption Boat Cosmetic Procedure
## 199 85 91
## Engagement Ring Green Loans Household Expenses
## 217 59 1996
## Large Purchases Medical/Dental Motorcycle
## 876 1522 304
## RV Taxes Vacation
## 52 885 768
## Wedding Loans Other Not Available
## 771 10494 16965
Most of loans’ purposes were undeclared. And more than 50% of loans were for Debt Consolidation. We notice that this could be for the probable young customers who start their employment journey by taking home loans, car loans, etc. and hence my guess is an older or settled generation would not be in this category.
This dataset seemed to be quite long and not so interesting. I was unable to make much conclusions as their were limited correlation between variables. My limited knowledge in the domain also seem to be affecting my report on insights. I tried to cover the major features in the dataset which I felt required attention. Other Features seemed redundant of not so interesting to me right now.
I focussed my attention on dataset in two prespectives: borrowers and investor. I found some relations between the rates and yields which was very new for me. I also was able to conclude some insights on the purpose of people taking loans.
As part of the future work, I would like to perform feature selection, extraction, in order to get better insights. I would also like to build a logistic regression model on this to predict some of the target features.